2 What is Bayesian statistical inference?
2.1 Introduction
2.1.1 Today’s topics
- What is Bayesian statistical inference?
- Why is it useful in general?
- Why is it useful in systems biology?
- The big challenge
2.1.2 Computer goals
Set up git/ssh, python, cmdstanpy and cmdstan
2.2 What is Bayesian statistical inference?
2.2.1 Probability function
A function that can measure the water in a jug.
i.e. \(p: S \rightarrow [0,1]\) where:
- \(p(S) = 1\)
- For disjoint \(A, B \in S\) \(p(A\cup B) = p(A) + p(B)\)
2.2.2 Statistical Inference
In: facts about a spoonful sample
Out: propositions about a soup population
e.g.
- spoonful not salty \(\rightarrow\) soup not salty
- no carrots in spoon \(\rightarrow\) no carrots in soup
2.2.3 Bayesian statistical inference
Statistical inference resulting in a probability.
e.g.
- spoon \(\rightarrow\) \(p(\text{soup not salty})\) = 99.9%
- spoon \(\rightarrow\) \(p(\text{no carrots in soup})\) = 95.1%
Non-Bayesian inferences:
- spoon \(\rightarrow\) Best estimate of [salt] is 0.1mol/l
- \(p_{null}(\text{spoon})\) = 4.9% \(\rightarrow\) no carrots (p=0.049)
3 Why is Bayesian statistical inference useful in general?
3.0.1 The philosophical reason
Bayesian inference can be interpreted in terms of information and plausible reasoning.
e.g. “According to the model…”
- “…x is highly plausible.”
- “…x is more plausible than y.”
- “…the data doesn’t contain enough information for firm conclusions about x.”
3.0.2 Mathematical reason
Bayesian inference is old!
This means
- it is well understood mathematically.
- conceptual surprises are relatively rare.
- there are many compatible frameworks.
3.0.3 General practical reason
Probabilities decompose nicely:
\[ p(\theta, y) = p(\theta)p(y\mid\hat{y}(\theta)) \]
- \(p(\theta)\): nice form for background information, e.g. anything non-experimental
- \(\hat{y}(\theta)\): nice form for structural information, e.g. physical laws
- \(p(y\mid\hat{y}(\theta))\): nice form for measurement information, e.g. instrument accuracy
3.1 Why is Bayesian inference useful in systems biology?
3.1.1 Regression models: good for describing measurements
Idea: measured value systematically but noisily depends on the true value e.g.
\(y \sim N(\hat{y}, \sigma)\)
Bayesian inference lends itself to regression models that accurately describe details of the measurement process. e.g.
- heteroskedasticity \(y \sim N(\hat{y}, \sigma(\hat{y}))\)
- non-negativity \(y \sim LN(\ln{\hat{y}}, \sigma)\) (also compositionality)
- unknown bias \(y \sim N(\hat{y} + q, \sigma)\)
3.1.2 Multi-level models: good for describing sources of variation
Measurement model:
\(y \sim binomial(K, logit(ability))\)
Gpareto model:
\(ability \sim GPareto(m, k, s)\)
Normal model:
\(ability \sim N(\mu, \tau)\)
3.1.3 Generative models: good for representing structural information
Information about hares (\(u\)) and lynxes (\(v\)):
\[\begin{align*} \frac{d}{dt}u &= (\alpha - \beta v)u \\ \frac{d}{dt}v &= (-\gamma + \delta u)v \end{align*}\]
i.e. a deterministic function turning \(\alpha\), \(\beta\), \(\gamma\), \(\delta\), \(u(0)\) and \(v(0)\) into \(u(t)\) and \(v(t)\).
3.2 The big challenge
3.2.1 The big challenge
\(p(\theta \mid y)\) is easy to evaluate but hard to integrate.
This is bad as we typically want something like
\[ p([salt] < 0.1, spoon=s) \]
which is equivalent to
\[ \int_{0}^{0.1}p([salt], spoon=s)d[salt] \]
\(p(\theta \mid y)\) has one dimension per model parameter.
3.2.2 The solution: MCMC
Strategy:
- Find a series of numbers that
- quickly finds the high-probabiliy region in parameter space
- reliably matches its statistical properties
- Do sample-based approximate integration.
It (often) works!
We can tell when it doesn’t work!
:
3.3 Homework
3.3.1 Things to read
Box and Tiao (1992) Ch. 1.1 (available from dtu findit) gives a nice explanation of statistical inference in general and why Bayes.
Historical interest:
3.3.2 Things to set up
3.3.2.1 Python
First get a recent (ideally 3.10+) version of Python This can be very annoying so talk to me if necessary!
Next get used to Python virtual environments.
The method I like is to put the virtual environment in a folder .venv inside the root of my project:
$ python -m venv .venv --prompt=bssb
Then to use:
alias va="source .venv/bin/activate"$ source .venv/bin/activate
# ... do work
$ deactivate
3.3.2.2 Git and ssh
git clone git@github.com:teddygroves/bayesian_statistics_for_systems_biologists.git
3.3.2.3 Cmdstanpy and cmdstan
from cmdstanpy import CmdStanModel
filename = "example_stan_program.stan"
code = "data {} parameters {real t;} model {t ~ std_normal();}"
with open(filename, "w") as f:
f.write(code)
model = CmdStanModel(stan_file=filename)
mcmc = model.sample()4 Next time
4.0.0.1 Theory
Hamiltonian Monte Carlo:
- what?
- why?
MCMC diagnostics
4.0.0.2 Computer
Stan, cmdstanpy, arviz:
- formats
- workflow
- write a model